Using Association Rules of the Online Retail Dataset
1 Load Data
We first want to load our datasets and prepare them for some simple association rules mining.
tnx_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")
tnx_data_tbl %>% glimpse()## Rows: 1,021,424
## Columns: 23
## $ row_id <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month <chr> "December", "December", "December", "December", "Dec…
## $ invoice_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
To use our rules mining we just need the invoice data and the stock code, so we can ignore the rest. Also, we ignore the issue of returns and just look at purchases.
tnx_purchase_tbl <- tnx_data_tbl %>%
filter(
quantity > 0,
price > 0,
exclude == FALSE
) %>%
select(
invoice_id, stock_code, customer_id, quantity, price, stock_value,
description
)
tnx_purchase_tbl %>% glimpse()## Rows: 992,023
## Columns: 7
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "489434"…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "2…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "130…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3,…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.75…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, 3…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHTS…
We now write this data out as a CSV so arules can read it in and process it.
tnx_purchase_tbl %>% write_csv("data/tnx_purchase_tbl.csv")We also want to load the free-text description of the various stock items as this will help will interpretation of the various rules we have found.
stock_description_tbl <- read_rds("data/stock_description_tbl.rds")
stock_description_tbl %>% glimpse()## Rows: 4,725
## Columns: 2
## $ stock_code <chr> "10002", "10002R", "10080", "10109", "10120", "10123C", "10…
## $ desc <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", "GRO…
2 Basket Analysis with Association Rules
We now want to do some basic basket analysis using association rules, which tries to determine which items get bought together, similar to taking a graph approachin many ways.
basket_arules <- read.transactions(
file = "data/tnx_purchase_tbl.csv",
format = "single",
sep = ",",
header = TRUE,
cols = c("invoice_id", "stock_code")
)
basket_arules %>% glimpse()## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## ..@ itemInfo :'data.frame': 4895 obs. of 1 variable:
## .. ..$ labels: chr [1:4895] "10002" "10002R" "10080" "10109" ...
## ..@ itemsetInfo:'data.frame': 39510 obs. of 1 variable:
## .. ..$ transactionID: chr [1:39510] "489434" "489435" "489436" "489437" ...
Now that we have this data we can look at some basic plots much like we produced before. For example, we can look at the relative frequency of the different items.
itemFrequencyPlot(basket_arules, topN = 20)itemFrequencyPlot(basket_arules, topN = 20, type = "absolute")The stock codes do not mean a huge amount to us, so we also want to look at the description field for these items.
freq_codes <- itemFrequency(basket_arules) %>%
sort(decreasing = TRUE) %>%
head(20) %>%
names()
stock_description_tbl %>%
filter(stock_code %in% freq_codes) %>%
arrange(stock_code) %>%
datatable()2.1 Basic Concepts
The basic ideas of association rule mining and basket analysis draws on basic ideas from probability theory.
We speak in terms of the itemset: that is, a collection of one or more items that co-occur in a transaction.
For example, suppose we have a list of transactions as follows:
| ID | Items |
|---|---|
| 1 | milk, bread |
| 2 | bread, butter |
| 3 | beer |
| 4 | milk, bread, butter |
| 5 | bread, butter |
Using the above set of transactions, and itemset may be “milk” or “bread, butter”.
The support of an itemset \(X\), \(\text{Supp}(X)\), is defined as the proportion of transactions in the dataset which contain the itemset.
In the above example:
\[ \text{Supp}(\text{\{milk, bread\}}) = \frac{2}{5} = 0.40. \]
A rule, \(X \Rightarrow Y\), between two itemsets \(X\) and \(Y\) is a directed relationship of the itemset \(X\) showing the presence of \(Y\). The rule is not symmetric: \(X \Rightarrow Y\) and \(Y \Rightarrow X\) are not the same.
The confidence for the rule \(X \implies Y\), \(\text{Conf}(X \Rightarrow Y)\) is defined by
\[ \text{Conf}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X)}. \]
So, to calculate the confidence for a rule:
\[ \text{Conf}(\text{\{milk, bread\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{0.4} = 0.5. \]
To illustrate how rules are not symmetric:
\[ \text{Conf}(\text{\{butter\}} \Rightarrow \text{\{milk, bread\}}) = \frac{0.2}{0.6} = 0.33. \]
Finally, we want a measure of the strength of the relationship between the itemsets \(X\) and \(Y\). That is, measuring the effect of the presence of \(X\) on the presence of \(Y\). We measure this by defining the lift of a rule as
\[ \text{Lift}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X) \text{Supp}(Y)}. \]
Again, we repeat our calculations for our rule.
\[ \text{Lift}(\text{\{bread, milk\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{(0.4)(0.6)} = \frac{0.2}{0.24} = 0.8333 \]
Lift values greater than 1 implies the presence of \(X\) increases the probability of \(Y\) being present when compared to the unconditional probability.
Now that we have these metrics and concepts, we can turn our attention to trying to find rules in a given dataset, using these metrics to rank them.
Rather than using brute-force approaches to discovering these rules, we use a number of different algorithms to find associations within the dataset.
The two main algorithms for discovering some rules are the apriori and the
eclat algorithms.
2.2 Construct apriori Rules
We now want to construct the association rules using the apriori algorithm.
To do this, we need to set parameters such as the minimum support and the
minimum confidence level.
This gives us a set of association rules, along with the support and lift.
basket_apriori <- apriori(
basket_arules,
parameter = list(supp = 0.005, conf = 0.8)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 197
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4895 item(s), 39510 transaction(s)] done [0.34s].
## sorting and recoding items ... [1445 item(s)] done [0.03s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.19s].
## writing ... [537 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
basket_apriori_tbl <- basket_apriori %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))
basket_apriori_tbl %>% glimpse()## Rows: 537
## Columns: 6
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "{2…
## $ support <dbl> 0.005391040, 0.005188560, 0.005391040, 0.005087320, 0.00528…
## $ confidence <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.93…
## $ coverage <dbl> 0.005694761, 0.005416350, 0.005669451, 0.005416350, 0.00556…
## $ lift <dbl> 150.2120, 150.1919, 149.0864, 149.0356, 148.9464, 148.7131,…
## $ count <int> 213, 205, 213, 201, 209, 209, 205, 204, 204, 203, 203, 232,…
We now want to inspect this table using the ruleExplorer()
basket_apriori %>% ruleExplorer()To help visualise these rules, we can produce a basic scatterplot of the metrics.
ggplot(basket_apriori_tbl) +
geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
xlab("Rule Confidence") +
ylab("Rule Lift") +
ggtitle("Scatterplot of Association Rule Metrics")2.3 Construct eclat Rules
An alternative method of constructing association rules is to use the eclat
algorithm. The code for doing this is slightly different, but gives us similar
outputs.
basket_eclat <- eclat(
basket_arules,
parameter = list(support = 0.005)
) %>%
ruleInduction(
basket_arules,
confidence = 0.8
)## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.005 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 197
##
## create itemset ...
## set transactions ...[4895 item(s), 39510 transaction(s)] done [0.33s].
## sorting and recoding items ... [1445 item(s)] done [0.02s].
## creating sparse bit matrix ... [1445 row(s), 39510 column(s)] done [0.01s].
## writing ... [7742 set(s)] done [4.12s].
## Creating S4 object ... done [0.00s].
basket_eclat_tbl <- basket_eclat %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))
basket_eclat_tbl %>% glimpse()## Rows: 537
## Columns: 5
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "{2…
## $ support <dbl> 0.005391040, 0.005188560, 0.005391040, 0.005087320, 0.00528…
## $ confidence <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.93…
## $ lift <dbl> 150.2120, 150.1919, 149.0864, 149.0356, 148.9464, 148.7131,…
## $ itemset <int> 69, 41, 69, 39, 73, 73, 41, 26, 26, 38, 68, 76, 76, 40, 69,…
Once again, we inspect the data using ruleExplorer()
basket_eclat %>% ruleExplorer()2.4 Compare Algorithms
We now want to compare the outputs of both algorithms in terms of association rules and how they compare.
basket_ap_tbl <- basket_apriori_tbl %>%
select(rules, support, confidence, lift)
basket_ec_tbl <- basket_eclat_tbl %>%
select(rules, support, confidence, lift)
rules_comparison_tbl <- basket_ap_tbl %>%
full_join(basket_ec_tbl, by = "rules", suffix = c("_a", "_e"))
rules_comparison_tbl %>% glimpse()## Rows: 537
## Columns: 7
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "…
## $ support_a <dbl> 0.005391040, 0.005188560, 0.005391040, 0.005087320, 0.005…
## $ confidence_a <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.…
## $ lift_a <dbl> 150.2120, 150.1919, 149.0864, 149.0356, 148.9464, 148.713…
## $ support_e <dbl> 0.005391040, 0.005188560, 0.005391040, 0.005087320, 0.005…
## $ confidence_e <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.…
## $ lift_e <dbl> 150.2120, 150.1919, 149.0864, 149.0356, 148.9464, 148.713…
2.5 Reducing Minimum Confidence
While high confidence rules are useful, they are more likely to find rules that are “obvious” as the probabilities are such that co-occuring basket items will be noticed as being together - or possibly be natural complements: butter, milk and bread is a good example.
Instead, we are also interested in less obvious rules, and so we reduce our confidence threshold and see how many additional rules are discovered.
basket_lower_rules <- apriori(
basket_arules,
parameter = list(supp = 0.005, conf = 0.4)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 197
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4895 item(s), 39510 transaction(s)] done [0.33s].
## sorting and recoding items ... [1445 item(s)] done [0.03s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.18s].
## writing ... [5950 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
basket_lower_rules_tbl <- basket_lower_rules %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))ggplot(basket_lower_rules_tbl) +
geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
xlab("Rule Confidence") +
ylab("Rule Lift") +
ggtitle("Scatterplot of Association Rule Metrics")3 Converting Rules to Graphs
We also have the ability to convert these rules to a graph representation,
where each node is either a stock_code or a rule, with the edges of the
graph representing that item being contained in the rule.
apriori_rules_igraph <- basket_apriori %>%
plot(
measure = "support",
method = "graph",
engine = "igraph",
control = list(max = 1000)
)apriori_rules_igraph %>% print()## IGRAPH 28a2af2 DN-B 679 2001 --
## + attr: name (v/c), label (v/c), type (v/n), support (v/n), confidence
## | (v/n), coverage (v/n), lift (v/n), count (v/n), order (v/n)
## + edges from 28a2af2 (vertex names):
## [1] 1732->assoc1 1255->assoc2 2120->assoc3 2117->assoc4 2120->assoc5
## [6] 2115->assoc6 2120->assoc7 2118->assoc8 2117->assoc9 2118->assoc10
## [11] 2119->assoc11 2118->assoc12 2115->assoc13 2118->assoc14 2116->assoc15
## [16] 2117->assoc16 2119->assoc17 2117->assoc18 2115->assoc19 2117->assoc20
## [21] 2116->assoc21 2119->assoc22 2115->assoc23 2119->assoc24 2116->assoc25
## [26] 2115->assoc26 2116->assoc27 2360->assoc28 2360->assoc29 1950->assoc30
## [31] 1950->assoc31 1950->assoc32 3748->assoc33 2359->assoc34 1949->assoc35
## + ... omitted several edges
We should first visualise this graph, using the top 100 rules in the dataset, as measured by the support of the rule.
basket_apriori %>%
head(n = 100, by = "support") %>%
plot(
measure = "lift",
method = "graph",
engine = "htmlwidget"
)3.1 Extract Connected Product Labels
First we want to look at the different disjoint components of the graph, and label them with an ID.
apriori_rules_tblgraph <- apriori_rules_igraph %>%
igraph::as.undirected(mode = "collapse") %>%
as_tbl_graph() %>%
mutate(
component_id = group_components()
) %>%
group_by(component_id) %>%
mutate(
component_size = n()
) %>%
ungroup()We then want to create groups of common products that form a disjoint cluster within this graph.
product_groups_all_tbl <- apriori_rules_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(component_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(
product_group_id = component_id,
product_count,
stock_code = label
) %>%
arrange(product_group_id, stock_code)
product_groups_all_tbl %>% glimpse()## Rows: 142
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count <int> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 6…
## $ stock_code <chr> "20711", "20712", "20713", "20718", "20719", "20723",…
For display purposes, we can show all the stock_id values in a list.
3.1.1 Cluster Larger Groups
Within the large disjoint cluster there are a large number of products so rather than treating this as a single group we instead may investigate using further graph clustering algorithms to create further groupings.
apriori_rules_large_tblgraph <- apriori_rules_tblgraph %>%
to_subgraph(component_size == max(component_size)) %>%
use_series(subgraph) %>%
morph(to_undirected) %>%
mutate(
sub_id = group_louvain()
) %>%
unmorph()Now that we have sub-divided this large subgraph, we repeat the process.
product_groups_largest_tbl <- apriori_rules_large_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(sub_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(
product_group_id = sub_id, product_count, stock_code = label
) %>%
arrange(product_group_id, stock_code)
product_groups_largest_tbl %>% glimpse()## Rows: 67
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,…
## $ product_count <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 16, 1…
## $ stock_code <chr> "20719", "20723", "20724", "20725", "20727", "20728",…
Finally, it is worth trying to use an interactive tool to investigate this
subgraph, we we can use visNetwork() to produce an interactive JS tool
apriori_rules_large_tblgraph %>%
toVisNetworkData(idToLabel = FALSE) %>%
visNetwork(
nodes = .$nodes %>% transmute(id, label, group = sub_id),
edges = .$edges
)3.2 Evaluating Product Groups
How do we go about assessing the validity of these product groups?
Note that this work is exploratory - in effect this is more sophisticated data exploration. Rather than use this model to make predictions - a job we will need to do at some point, we instead just want to assess how novel these grouping are.
To that end, it may be useful to check the co-occurrence of these products as a group - for each purchase made by a customer, what proportion of the group was featured in this data?
This question is worth exploring, so we should write some code to assess this.
Before we do this, we combine our two lists of product groups into a single table.
product_groups_tbl <- list(
ALL = product_groups_all_tbl,
LRG = product_groups_largest_tbl
) %>%
bind_rows(.id = "type") %>%
mutate(
group_label = sprintf("%s_%02d", type, product_group_id)
) %>%
group_by(group_label) %>%
mutate(
group_size = n()
) %>%
ungroup() %>%
select(group_label, group_size, stock_code)
product_groups_tbl %>% glimpse()## Rows: 209
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size <int> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67…
## $ stock_code <chr> "20711", "20712", "20713", "20718", "20719", "20723", "207…
tnx_groups_tbl <- tnx_data_tbl %>%
select(invoice_id, invoice_date, stock_code) %>%
group_nest(invoice_id, .key = "invoice_data")
group_props_tbl <- product_groups_tbl %>%
group_nest(group_label, group_size, .key = "stock_data") %>%
filter(group_size > 1, group_size < 15) %>%
expand_grid(tnx_groups_tbl) %>%
mutate(
comb_data = future_map2(
invoice_data, stock_data,
inner_join,
by = "stock_code",
.options = furrr_options(globals = FALSE)
),
match_count = map_int(comb_data, nrow),
group_prop = match_count / group_size
) %>%
select(group_label, group_size, group_prop) %>%
filter(group_prop > 0)
group_props_tbl %>% glimpse()## Rows: 54,147
## Columns: 3
## $ group_label <chr> "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02"…
## $ group_size <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ group_prop <dbl> 0.2, 0.6, 0.3, 0.3, 0.1, 0.7, 0.2, 0.4, 0.2, 0.4, 0.1, 0.2…
We now create a histogram of the proportions for each group, and this gives us a gauge of the ‘novelty’ of each of these groups.
plot_tbl <- group_props_tbl %>%
mutate(label = glue("{group_label} ({group_size})"))
ggplot(plot_tbl) +
geom_histogram(aes(x = group_prop), binwidth = 0.1) +
facet_wrap(vars(label), scales = "free_y") +
scale_y_continuous(labels = label_comma()) +
xlab("Proportion") +
ylab("Purchase Count") +
ggtitle("Facetted Histograms of Group Coverages by Product Grouping") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))3.2.1 Investigate Groups
Now that we have our groups we add on the description fields so that interpretation of the different groupings is easier.
product_groups_tbl %>%
filter(group_size > 1, group_size < 15) %>%
mutate(stock_code = stock_code %>% str_trim() %>% str_to_upper()) %>%
left_join(stock_description_tbl, by = "stock_code") %>%
datatable()4 Investigate Lower Support Rules
Our previous analysis was all based on rules with a minimum confidence of 0.80 so we now want to repeat our analysis but on this new set of rules.
apriori_lower_rules_igraph <- basket_lower_rules %>%
plot(
measure = "support",
method = "graph",
engine = "igraph",
control = list(max = 5000)
)apriori_lower_rules_igraph %>% print()## IGRAPH 11f8463 DN-B 5460 15770 --
## + attr: name (v/c), label (v/c), type (v/n), support (v/n), confidence
## | (v/n), coverage (v/n), lift (v/n), count (v/n), order (v/n)
## + edges from 11f8463 (vertex names):
## [1] 249 ->assoc1 1608->assoc1 567 ->assoc2 1582->assoc2 567 ->assoc3
## [6] 616 ->assoc3 1223->assoc4 1606->assoc4 259 ->assoc5 1223->assoc5
## [11] 261 ->assoc6 1223->assoc6 261 ->assoc7 1608->assoc7 1223->assoc8
## [16] 1582->assoc8 258 ->assoc9 1223->assoc9 4121->assoc10 4127->assoc10
## [21] 616 ->assoc11 4121->assoc11 1247->assoc12 4121->assoc12 616 ->assoc13
## [26] 4121->assoc13 1903->assoc14 2159->assoc14 1901->assoc15 2159->assoc15
## [31] 249 ->assoc16 4278->assoc16 249 ->assoc17 1608->assoc17 1331->assoc18
## + ... omitted several edges
Once again we have a quick look at the top 50 rules.
basket_lower_rules %>%
head(n = 50, by = "support") %>%
plot(
measure = "lift",
method = "graph",
engine = "htmlwidget"
)4.1 Determine Distinct Rules Subgraphs
Having converted the association rules to the graph, we then look at the distinct components of this graph and use these as our first pass at these clusters.
apriori_lower_rules_tblgraph <- apriori_lower_rules_igraph %>%
igraph::as.undirected(mode = "collapse") %>%
as_tbl_graph() %>%
mutate(
component_id = group_components()
) %>%
group_by(component_id) %>%
mutate(
component_size = n()
) %>%
ungroup()
apriori_lower_rules_tblgraph %>% print()## # A tbl_graph: 5460 nodes and 15770 edges
## #
## # A bipartite simple graph with 88 components
## #
## # Node Data: 5,460 × 11 (active)
## name label type support confidence coverage lift count order component_id
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 25 1505… 1 NA NA NA NA NA NA 14
## 2 27 1505… 1 NA NA NA NA NA NA 14
## 3 29 1505… 1 NA NA NA NA NA NA 14
## 4 217 20674 1 NA NA NA NA NA NA 2
## 5 218 20675 1 NA NA NA NA NA NA 2
## 6 219 20676 1 NA NA NA NA NA NA 2
## # … with 5,454 more rows, and 1 more variable: component_size <int>
## #
## # Edge Data: 15,770 × 2
## from to
## <int> <int>
## 1 11 461
## 2 217 461
## 3 218 461
## # … with 15,767 more rows
product_groups_lower_all_tbl <- apriori_lower_rules_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(component_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(product_group_id = component_id, product_count, stock_code = label) %>%
arrange(product_group_id, stock_code)
product_groups_lower_all_tbl %>% glimpse()## Rows: 460
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count <int> 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216…
## $ stock_code <chr> "20711", "20712", "20713", "20717", "20718", "20719",…
apriori_lower_rules_bigcomp_tblgraph <- apriori_lower_rules_tblgraph %>%
to_subgraph(component_size == max(component_size)) %>%
use_series(subgraph) %>%
mutate(
sub_id = group_louvain()
)
apriori_lower_rules_bigcomp_tblgraph %>% print()## # A tbl_graph: 4456 nodes and 13807 edges
## #
## # A bipartite simple graph with 1 component
## #
## # Node Data: 4,456 × 12 (active)
## name label type support confidence coverage lift count order component_id
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 247 20711 1 NA NA NA NA NA NA 1
## 2 248 20712 1 NA NA NA NA NA NA 1
## 3 249 20713 1 NA NA NA NA NA NA 1
## 4 253 20717 1 NA NA NA NA NA NA 1
## 5 254 20718 1 NA NA NA NA NA NA 1
## 6 255 20719 1 NA NA NA NA NA NA 1
## # … with 4,450 more rows, and 2 more variables: component_size <int>,
## # sub_id <int>
## #
## # Edge Data: 13,807 × 2
## from to
## <int> <int>
## 1 3 217
## 2 98 217
## 3 99 217
## # … with 13,804 more rows
product_groups_lower_bigcomp_tbl <- apriori_lower_rules_bigcomp_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(sub_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(product_group_id = sub_id, product_count, stock_code = label) %>%
arrange(product_group_id, stock_code)
product_groups_lower_bigcomp_tbl %>% glimpse()## Rows: 216
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count <int> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 2…
## $ stock_code <chr> "20711", "20712", "20713", "20717", "20718", "21033",…
product_groups_lower_tbl <- list(
ALL = product_groups_lower_all_tbl,
LRG = product_groups_lower_bigcomp_tbl
) %>%
bind_rows(.id = "type") %>%
mutate(
group_label = sprintf("%s_%02d", type, product_group_id)
) %>%
group_by(group_label) %>%
mutate(
group_size = n()
) %>%
ungroup() %>%
select(group_label, group_size, stock_code)
product_groups_lower_tbl %>% glimpse()## Rows: 676
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size <int> 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216…
## $ stock_code <chr> "20711", "20712", "20713", "20717", "20718", "20719", "207…
Now that we have our groups we add on the description fields so that interpretation of the different groupings is easier.
product_groups_lower_tbl %>%
filter(group_size > 1, group_size != max(group_size)) %>%
mutate(stock_code = stock_code %>% str_trim() %>% str_to_upper()) %>%
left_join(stock_description_tbl, by = "stock_code") %>%
datatable()5 Output Data to Disk
We now want to write the various data groups to disk.
As this may be useful for later analysis and for later modelling, we output these groupings for later use.
product_groups_tbl %>% write_rds("data/product_groups_tbl.rds")
product_groups_lower_tbl %>% write_rds("data/product_groups_lower_tbl.rds")6 R Environment
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.1.0 (2021-05-18)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2021-10-12
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## arules * 1.6-8 2021-05-17 [1] RSPM (R 4.1.0)
## arulesViz * 1.5-0 2021-05-21 [1] RSPM (R 4.1.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.1.0)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.1.0)
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.1.0)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.1.0)
## bookdown 0.22 2021-04-22 [1] RSPM (R 4.1.0)
## broom 0.7.9 2021-07-27 [1] RSPM (R 4.1.0)
## bslib 0.2.5.1 2021-05-18 [1] RSPM (R 4.1.0)
## cachem 1.0.5 2021-05-15 [1] RSPM (R 4.1.0)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.1.0)
## cli 3.0.1 2021-07-17 [1] RSPM (R 4.1.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.0)
## colorspace 2.0-2 2021-06-24 [1] RSPM (R 4.1.0)
## conflicted * 1.0.4 2019-06-21 [1] RSPM (R 4.1.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.1.0)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.1.0)
## crosstalk 1.1.1 2021-01-12 [1] RSPM (R 4.1.0)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.1.0)
## dbplyr 2.1.1 2021-04-06 [1] RSPM (R 4.1.0)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.1.0)
## dplyr * 1.0.7 2021-06-18 [1] RSPM (R 4.1.0)
## DT * 0.18 2021-04-14 [1] RSPM (R 4.1.0)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.1.0)
## fansi 0.5.0 2021-05-25 [1] RSPM (R 4.1.0)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.1.0)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.1.0)
## fs 1.5.0 2020-07-31 [1] RSPM (R 4.1.0)
## furrr * 0.2.3 2021-06-25 [1] RSPM (R 4.1.0)
## future * 1.21.0 2020-12-10 [1] RSPM (R 4.1.0)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.1.0)
## ggplot2 * 3.3.5 2021-06-25 [1] RSPM (R 4.1.0)
## globals 0.14.0 2020-11-22 [1] RSPM (R 4.1.0)
## glue * 1.4.2 2020-08-27 [1] RSPM (R 4.1.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.1.0)
## haven 2.4.3 2021-08-04 [1] RSPM (R 4.1.0)
## highr 0.9 2021-04-16 [1] RSPM (R 4.1.0)
## hms 1.1.0 2021-05-17 [1] RSPM (R 4.1.0)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.1.0)
## htmlwidgets 1.5.3 2020-12-10 [1] RSPM (R 4.1.0)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.1.0)
## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.1.0)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.1.0)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.1.0)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.1.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.1.0)
## lattice 0.20-44 2021-05-02 [2] CRAN (R 4.1.0)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.1.0)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.1.0)
## lubridate 1.7.10 2021-02-26 [1] RSPM (R 4.1.0)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.1.0)
## Matrix * 1.3-3 2021-05-04 [2] CRAN (R 4.1.0)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.1.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.1.0)
## parallelly 1.27.0 2021-07-19 [1] RSPM (R 4.1.0)
## pillar 1.6.2 2021-07-29 [1] RSPM (R 4.1.0)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.1.0)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.1.0)
## Rcpp 1.0.7 2021-07-07 [1] RSPM (R 4.1.0)
## readr * 2.0.0 2021-07-20 [1] RSPM (R 4.1.0)
## readxl 1.3.1 2019-03-13 [1] RSPM (R 4.1.0)
## reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0)
## rlang * 0.4.11 2021-04-30 [1] RSPM (R 4.1.0)
## rmarkdown 2.10 2021-08-06 [1] RSPM (R 4.1.0)
## rmdformats 1.0.2 2021-04-19 [1] RSPM (R 4.1.0)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0)
## rvest 1.0.1 2021-07-26 [1] RSPM (R 4.1.0)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.1.0)
## scales * 1.1.1 2020-05-11 [1] RSPM (R 4.1.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.1.0)
## stringi 1.7.3 2021-07-16 [1] RSPM (R 4.1.0)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.1.0)
## tibble * 3.1.3 2021-07-23 [1] RSPM (R 4.1.0)
## tidygraph * 1.2.0 2020-05-12 [1] RSPM (R 4.1.0)
## tidyr * 1.1.3 2021-03-03 [1] RSPM (R 4.1.0)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.1.0)
## tidyverse * 1.3.1 2021-04-15 [1] RSPM (R 4.1.0)
## tzdb 0.1.2 2021-07-20 [1] RSPM (R 4.1.0)
## utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.1.0)
## visNetwork 2.0.9 2019-12-06 [1] RSPM (R 4.1.0)
## vroom 1.5.4 2021-08-05 [1] RSPM (R 4.1.0)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.1.0)
## xfun 0.25 2021-08-06 [1] RSPM (R 4.1.0)
## xml2 1.3.2 2020-04-23 [1] RSPM (R 4.1.0)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.1.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library